Skip to content

Variable Grouped Swizzle#2914

Merged
vthumbe1503 merged 11 commits intoNVIDIA:mainfrom
int-smart:feat/variable_swizzle
Apr 30, 2026
Merged

Variable Grouped Swizzle#2914
vthumbe1503 merged 11 commits intoNVIDIA:mainfrom
int-smart:feat/variable_swizzle

Conversation

@int-smart
Copy link
Copy Markdown
Contributor

@int-smart int-smart commented Apr 22, 2026

Description

Grouped Swizzle with variable shape. Not sure if this is needed but if not can be closed.

Fixes #2451

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

int-smart and others added 3 commits April 21, 2026 23:54
… update C++ operator interface

Signed-off-by: Abhishek <abhi.dtu11@gmail.com>
Signed-off-by: Abhishek <abhi.dtu11@gmail.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 22, 2026

Greptile Summary

This PR lifts the hard restriction on variable-shape grouped swizzle by adding grouped_swizzle_scaling_variable_shape_kernel — a persistent-grid kernel that computes per-tensor scale offsets via a shared-memory warp reduction and a per-block linear scan over tensors — and routes swizzle_grouped_scaling_factors to it when !all_same_shape(). The C++ test path correctly exercises the new kernel. However, the PyTorch wrapper (maybe_swizzle_grouped_tensor) still does not forward first_dims/last_dims from the input to swizzle_input, so all_same_shape() always returns true there and the variable-shape kernel remains unreachable from Python; compute_padded_grouped_scale_shape also uses an averaged per_tensor_first_dim, which would produce a wrong-sized output allocation once the forwarding is fixed.

Confidence Score: 3/5

Not safe to merge as-is: the variable-shape kernel is unreachable from the PyTorch path, making the PR's stated feature non-functional end-to-end.

A pre-existing P1 finding (first_dims/last_dims not forwarded to swizzle_input in the PyTorch wrapper) means the entire new variable-shape kernel path is dead code from Python. The C++ kernel itself and the C-API test look structurally sound, but the integration gap means the feature cannot be exercised through the primary user-facing interface. Two P2 findings (misleading test comment, implicit size_t→int narrowing) do not lower the score further.

transformer_engine/pytorch/csrc/extensions/swizzle.cpp — first_dims/last_dims forwarding to swizzle_input and compute_padded_grouped_scale_shape fix required before the feature works end-to-end.

Important Files Changed

Filename Overview
transformer_engine/pytorch/csrc/extensions/swizzle.cpp Removes the hard rejection of variable-shape inputs but does not forward first_dims/last_dims to swizzle_input, so all_same_shape() always returns true and the new variable-shape kernel is unreachable from Python; additionally compute_padded_grouped_scale_shape uses the averaged per_tensor_first_dim, producing wrong output allocation for variable-shape inputs.
transformer_engine/common/swizzle/swizzle.cu Adds grouped_swizzle_scaling_variable_shape_kernel (persistent-grid, per-block shape scan) and wires it into swizzle_grouped_scaling_factors; reachable from C++ but not from the PyTorch wrapper due to unresolved first_dims/last_dims forwarding.
tests/cpp/operator/test_swizzle.cu New performTestGroupedSwizzleMXFP8Variable test correctly exercises the variable-shape kernel for cases 2/3/5/6; cases 1 and 4 route through the uniform kernel and do not test the new persistent-grid path despite the misleading comment on case 4.
tests/cpp/test_common.cu Correctly expands offsets to num_tensors+1 (sentinel style) and already sets kNVTEGroupedFirstDims/kNVTEGroupedLastDims when shapes differ; the gather_scales stride matches the kernel's padded_m*padded_k accumulation.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["nvte_swizzle_grouped_scaling_factors"] --> B["swizzle_grouped_scaling_factors"]
    B --> C{is_variable_shape?}
    C -- "No (all_same_shape)" --> D["Uniform 3D-grid path\n(grouped_swizzle_{row,col}_scaling_uniform_shape_kernel)"]
    C -- "Yes (!all_same_shape)" --> E["Variable-shape persistent-grid path\n(grouped_swizzle_scaling_variable_shape_kernel)"]
    E --> F["Warp reduction → s_total_blocks\n(shared mem, block-local)"]
    F --> G["Persistent loop:\nlinear_block_id = blockIdx.x … total_blocks"]
    G --> H["Per-block linear scan over tensors\naccumulate current_scale_base"]
    H --> I["dispatch_swizzle_{row,col}_scaling_kernel_impl\n(vec_load_size ∈ {1,2,4})"]

    P["PyTorch: maybe_swizzle_grouped_tensor"] --> Q["Build swizzle_input WITHOUT\nfirst_dims / last_dims"]
    Q --> R["all_same_shape() = true always\n→ uniform kernel only"]
    R -. "variable-shape kernel\nnever reached" .-> E

    style R fill:#f88,stroke:#c00
    style E fill:#8f8,stroke:#060
Loading

Reviews (6): Last reviewed commit: "Merge branch 'main' into feat/variable_s..." | Re-trigger Greptile

Comment on lines +1891 to +1903
int device_id;
cudaGetDevice(&device_id);
int num_SMs;
cudaDeviceGetAttribute(&num_SMs, cudaDevAttrMultiProcessorCount, device_id);
// Find out how many blocks of this specific kernel can fit on one SM
int max_active_blocks_per_sm;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&max_active_blocks_per_sm,
grouped_swizzle_scaling_variable_shape_kernel<SF_TILE_DIM_M, SF_TILE_DIM_K>,
TB_DIM * TB_DIM, // block size
max_slm_size // dynamic shared memory
);
int persistent_blocks = num_SMs * max_active_blocks_per_sm;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Unchecked CUDA API calls can silently produce zero-block launches

cudaGetDevice, cudaDeviceGetAttribute, and cudaOccupancyMaxActiveBlocksPerMultiprocessor are all called without NVTE_CHECK_CUDA. If any of these fail, max_active_blocks_per_sm is left with an indeterminate (or zero) value, making persistent_blocks = 0. Launching the persistent kernel with 0 blocks is legal in CUDA — it silently does nothing — so the output buffer stays uninitialized with no error raised.

Suggested change
int device_id;
cudaGetDevice(&device_id);
int num_SMs;
cudaDeviceGetAttribute(&num_SMs, cudaDevAttrMultiProcessorCount, device_id);
// Find out how many blocks of this specific kernel can fit on one SM
int max_active_blocks_per_sm;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&max_active_blocks_per_sm,
grouped_swizzle_scaling_variable_shape_kernel<SF_TILE_DIM_M, SF_TILE_DIM_K>,
TB_DIM * TB_DIM, // block size
max_slm_size // dynamic shared memory
);
int persistent_blocks = num_SMs * max_active_blocks_per_sm;
int device_id;
NVTE_CHECK_CUDA(cudaGetDevice(&device_id));
int num_SMs;
NVTE_CHECK_CUDA(cudaDeviceGetAttribute(&num_SMs, cudaDevAttrMultiProcessorCount, device_id));
// Find out how many blocks of this specific kernel can fit on one SM
int max_active_blocks_per_sm;
NVTE_CHECK_CUDA(cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&max_active_blocks_per_sm,
grouped_swizzle_scaling_variable_shape_kernel<SF_TILE_DIM_M, SF_TILE_DIM_K>,
TB_DIM * TB_DIM, // block size
max_slm_size // dynamic shared memory
));
NVTE_CHECK(max_active_blocks_per_sm > 0, "Occupancy query returned 0 blocks per SM.");
int persistent_blocks = num_SMs * max_active_blocks_per_sm;

Comment on lines +1720 to +1724
if (!is_variable_shape) {
// Fallback to uniform shape implementation
NVTE_CHECK(input->all_same_shape(), "Grouped swizzle requires uniform tensor shapes.");
NVTE_CHECK(input->all_same_last_dim() && input->all_same_first_dim(),
"Grouped swizzle requires uniform tensor shapes.");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Dead code: redundant assertions inside !is_variable_shape branch

is_variable_shape is defined as !input->all_same_shape(), so inside if (!is_variable_shape) the two NVTE_CHECK calls are tautologies — they can never fire. They add noise and could mislead future readers into thinking the branch can handle non-uniform shapes. Consider removing them or converting them to a comment.

Suggested change
if (!is_variable_shape) {
// Fallback to uniform shape implementation
NVTE_CHECK(input->all_same_shape(), "Grouped swizzle requires uniform tensor shapes.");
NVTE_CHECK(input->all_same_last_dim() && input->all_same_first_dim(),
"Grouped swizzle requires uniform tensor shapes.");
if (!is_variable_shape) {
// All tensors share the same shape; use the optimised uniform-shape path.

if (int_stride % 2 != 0) int_stride++;
int* d_block_offsets = reinterpret_cast<int*>(workspace);
int* d_global_counter = d_block_offsets + num_tensors + 1;
int* d_total_blocks = d_global_counter + 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 d_total_blocks is written but never consumed

d_total_blocks is populated by compute_grouped_swizzle_setup (as *total_blocks = current_block_offset) but is never read by the persistent kernel or any host code afterward. The persistent grid terminates via the tensor_id == -1 sentinel, not via a stored total. If this field was intended as a diagnostic or future guard, a comment would clarify its purpose; otherwise it can be removed to avoid confusing future maintainers and wasting a device-side write.

Comment on lines +402 to +412
size_t num_tensors = input.num_tensors();
size_t num_int_elems = num_tensors + 3; // n+1 block_offsets + gc + tb
if (num_int_elems % 2 != 0) num_int_elems++; // pad to even for size_t alignment
size_t workspace_size = num_int_elems * sizeof(int) + (num_tensors + 1) * sizeof(size_t);
workspace_size = roundup(workspace_size, 256);
auto workspace =
allocateSpace(std::vector<size_t>{workspace_size}, transformer_engine::DType::kByte, false);

NVTE_SCOPED_GIL_RELEASE({
nvte_swizzle_grouped_scaling_factors(swizzle_input.data(), swizzle_output.data(),
at::cuda::getCurrentCUDAStream());
getDataPtr(workspace), at::cuda::getCurrentCUDAStream());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Workspace allocated unconditionally even for uniform-shape inputs

The workspace is only consumed by the variable-shape code path in swizzle_grouped_scaling_factors. For uniform shapes the pointer is accepted but immediately ignored. Gating the allocation on whether variable shapes are present (e.g., first_dims.data_ptr != nullptr || last_dims.data_ptr != nullptr) would avoid a small but unnecessary device allocation on every invocation with uniform tensors. This is a performance suggestion, not a correctness issue.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

@vthumbe1503 vthumbe1503 self-requested a review April 22, 2026 22:33
cudaFuncAttributeMaxDynamicSharedMemorySize, max_slm_size));

int device_id;
cudaGetDevice(&device_id);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caching the attrbues like number of sms and max active blocks per device would be ideal to reduce CPU overheads on each call.

We already have a function in transformer_engine/common/util/cuda_runtime.cpp called "sm_count". Could you please use that here?

// Fallback to uniform shape implementation
NVTE_CHECK(input->all_same_shape(), "Grouped swizzle requires uniform tensor shapes.");
NVTE_CHECK(input->all_same_last_dim() && input->all_same_first_dim(),
"Grouped swizzle requires uniform tensor shapes.");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These checks might not be needed. Given we used input->all_same_shape() to reach this stage

Copy link
Copy Markdown
Collaborator

@vthumbe1503 vthumbe1503 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that workspace allocation + small kernel for computing offsets + persistent kernel might be an overkill for swizzling. @int-smart Do you have some performance numbers by any chance for the swizzling kernel on Blackwell?

How about we follow a SM filling grid pattern like in grouped_bias_add kernel in this PR?
https://github.com/NVIDIA/TransformerEngine/pull/2885/changes/BASE..b64559af9b89d816b8d7ffba4f5273e556d90c8e#diff-fa75cbeb11caf588f79b811be355c8f00b0cf5d4b807c259b94f2a40ffc8db6f

With this pattern thread block id is dynamically decided based on sum(first_dims) and at the same time we divide the rows of grouped_tensor uniformly among the SMs. However it only handles variable first_dims(Need to extend the idea for other cases like all dims being variable)

@vthumbe1503 vthumbe1503 self-assigned this Apr 22, 2026
@int-smart
Copy link
Copy Markdown
Contributor Author

@vthumbe1503 Will check the PR and get back

@int-smart
Copy link
Copy Markdown
Contributor Author

int-smart commented Apr 22, 2026

With regards to Blackwell I dont have the numbers tbh. I can generate it for RTX 40 series

int-smart and others added 2 commits April 25, 2026 15:55
@int-smart
Copy link
Copy Markdown
Contributor Author

int-smart commented Apr 25, 2026

@vthumbe1503 Consolidated to one kernel, removed shared memory allocation and tried to stick to the PR you mentioned. If this works let me know. Seems to perform better on rtx 4070 than my last approach. There are still some optimizations can be done but that would need more shared memory alloc.

int num_SMs;
cudaDeviceGetAttribute(&num_SMs, cudaDevAttrMultiProcessorCount, device_id);
int max_active_blocks_per_sm;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add NVTE_CHECK_CUDA around these APIs.

@vthumbe1503
Copy link
Copy Markdown
Collaborator

Thanks for the PR @int-smart. Overall LGTM. Have a few minor comments w.r.t CPU overheads and error handling.

@vthumbe1503
Copy link
Copy Markdown
Collaborator

/te-ci

int-smart and others added 4 commits April 29, 2026 18:25
Signed-off-by: Abhishek <abhi.dtu11@gmail.com>
Signed-off-by: Abhishek <abhi.dtu11@gmail.com>
Signed-off-by: Abhishek <abhi.dtu11@gmail.com>
Comment on lines +2213 to +2220
const size_t padded_scale_elems = padded_m * padded_k;
const size_t compact_scale_elems =
rowwise ? m * padded_k : DIVUP(k, static_cast<size_t>(MXFP8_BLOCK_SIZE)) * padded_m;
const size_t compact_total_scale_elems =
rowwise ? round_up_to_multiple(input->num_tensors * m, 128) * padded_k
: round_up_to_multiple(
input->num_tensors * DIVUP(k, static_cast<size_t>(MXFP8_BLOCK_SIZE)), 4) *
padded_m;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Variable-shape kernel assumes padded-layout input; no compact-layout support

In the uniform path (lines 2213–2240), the code detects whether the input is in "compact" or "padded" layout and adjusts input_stride_bytes accordingly. The variable-shape kernel (grouped_swizzle_scaling_variable_shape_kernel) has no equivalent detection — current_scale_base accumulates padded_m * padded_k * scale_elem_size unconditionally, which is the padded stride. If the variable-shape input arrives in the compact layout (which the quantize kernel emits), reads for all tensors after the first will be offset to the wrong position in memory, producing corrupted scale factors with no error raised.

Comment thread tests/cpp/test_common.cu
Comment on lines 1210 to 1219
}

if (!same_first || !same_last) {
grouped.offsets_dev = cuda_alloc<int64_t>(num_tensors * sizeof(int64_t));
size_t num_off = num_tensors + 1;
grouped.offsets_dev = cuda_alloc<int64_t>(num_off * sizeof(int64_t));
NVTE_CHECK_CUDA(cudaMemcpy(grouped.offsets_dev.get(), offsets.data(),
num_tensors * sizeof(int64_t), cudaMemcpyHostToDevice));
NVTEShape off_shape = nvte_make_shape(&num_tensors, 1);
num_off * sizeof(int64_t), cudaMemcpyHostToDevice));
NVTEShape off_shape = nvte_make_shape(&num_off, 1);
NVTEBasicTensor off_tensor{grouped.offsets_dev.get(), kNVTEInt64, off_shape};
nvte_set_grouped_tensor_param(h, kNVTEGroupedTensorOffsets, &off_tensor, sizeof(off_tensor));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 build_grouped_tensor sets offsets but not first_dims/last_dims — variable-shape test exercises uniform kernel

nvte_set_grouped_tensor_param is called for kNVTEGroupedTensorOffsets but never for kNVTEGroupedFirstDims or kNVTEGroupedLastDims. As a result, input->all_same_shape() returns true inside swizzle_grouped_scaling_factors even for the variable-shape test cases, so the new grouped_swizzle_scaling_variable_shape_kernel is never actually exercised. The test validates the uniform kernel with a wider variety of shapes rather than the new variable-shape code path it claims to cover.

kNVTEGroupedFirstDims and kNVTEGroupedLastDims need to be populated (analogous to how offsets are populated) for the variable-shape branch to be reached.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think this comment is valid right @int-smart ?

@vthumbe1503
Copy link
Copy Markdown
Collaborator

/te-ci

Copy link
Copy Markdown
Collaborator

@vthumbe1503 vthumbe1503 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@vthumbe1503 vthumbe1503 merged commit a7a2b3b into NVIDIA:main Apr 30, 2026
2 checks passed
@int-smart int-smart deleted the feat/variable_swizzle branch May 1, 2026 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Grouped swizzle kernel

2 participants